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Abstract 



Computer experiments are often performed to allow modeling of a response surface of a 
physical experiment that can be too costly or difficult to run except using a simulator. 
Running the experiment over a dense grid can be prohibitively expensive, yet running 
over a sparse design chosen in advance can result in insufficient information in parts of 
the space, particularly when the surface calls for a nonstationary model. We propose 
an approach that automatically explores the space while simultaneously fitting the 
response surface, using predictive uncertainty to guide subsequent experimental runs. 
The newly developed Bayesian treed Gaussian process is used as the surrogate model, 
and a fully Bayesian approach allows explicit measures of uncertainty. We develop an 
adaptive sequential design framework to cope with an asynchronous, random, agent- 
based supercomputing environment, by using a hybrid approach that melds optimal 
strategies from the statistics literature with flexible strategies from the active learning 
literature. The merits of this approach are borne out in several examples, including 
the motivating computational fluid dynamics simulation of a rocket booster. 
Key words: nonstationary spatial model, treed partitioning, sequential design, active 
learning 
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1 Introduction 



Many complex phenomena are difficult to investigate directly through controlled experi- 



ments. Instead, computer si mulation is 



insight into such phenomena (jSacks et al. 



j ecorn i ng a commonp 



1989 



Santner et al. 



ace a lternative to provide 



20031 ). However, the drive 



towards higher fidelity simulation continues to tax the fastest computers, even in highly 
distributed computing environments. Computational fiuid dynamics (CFD) simulations in 
which fiuid fiow phenomena are modeled are an excellent example — fiuid fiows over complex 
surfaces may be modeled accurately but only at the cost of supercomputer resources. In 
this paper we explore the problem of fitting a response surface for a computer model when 
the experiment can be designed adaptively, i.e., online — a task to which the Bayesian ap- 



proach is particu 



Chipman et al. 



arly well-suited. To do so, we will combine eleme nts from treed modeling 



2OO2I ) with modern Bayesian surrogate modeling (IKennedv and O 



200ll ). and elements of the sequentia. 



with active learning (IMacKay 



1992 



design of co mputer experiments (jSantner et al 



agan 



Cohn 



2OO3I) 



19961 ). The result is a fast and fiexible design 



interface for the sequential design of supercomputer experiments. 

Consider a simulation model which defines a mapping, perhaps non-deterministic, from 
parameters describing the inputs to one or more output responses. Without an analytic 
representation of this mapping, simulations must be run for many different input configu- 
rations in order to build up an understanding of its possible outcomes. Even in extremely 
parallel computing environments, computational expense of the simulation and/or high di- 
mensional input often prohibits the naive approach of running the experiment over a dense 
grid of input configurations. More sophisticated design strategies, such as a Latin Hypercube 
sample (LHS), maximin designs, orthogonal arrays, and maximum entropy designs can offer 
an improvement over gridding. Sequential versions of these are better still. However, such 
traditional approaches are "stationary" (or global, or uniform) in the sense they are based 
on a metric (e.g., distance) which is measured identically throughout the input space. The 



resulting designs are sometimes called "sparse" , or "space-filling" . Maximum entropy designs 
are literally stationary when based on stationary models (e.g., linear or Gaussian Process 
models). Such sparse, or stationary, design strategies are a mismatch when the responses 
necessitate a nonstationary model — common in experiments modeling physical processes, 
e.g., fluid dynamics — as they cannot learn about, and thus concentrate exploration in, more 
interesting or complicated regions of the input space. 

For example, NASA is developing a new re-usable rocket booster called the Langley 
Glide-Back Booster (LGBB). Much of its development is being done with computer mod- 
els. In particular, NASA is interested in learning about the response in flight characteristics 
(lift, drag, pitch, side-force, yaw, and roll) of the LGBB as a function of three inputs (speed 
in Mach number, angle of attack, and side slip angle) when the vehicle is re-entering the 
atmosphere. For each input configuration triplet, CFD simulations yield six response out- 
puts. Figure [U shows the lift response as a function of speed (Mach) and angle of attack 
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Figure 1: Lift plotted as a function of Mach (speed) and alpha (angle of attack) with beta 
(side-slip angle) fixed to zero. The ridge at Mach 1 separates subsonic from supersonic cases. 

(alpha) with the side-slip angle (beta) fixed at zero. The figure shows how the characteristics 
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of subsonic flows can be quite different from supersonic flows, as indicated by the ridge in 
the response surface at Mach 1. Moreover, the transition between subsonic and supersonic 
is distinctly non-hnear and may even be non-differentiable or non- continuous. The CFD 
simulations involve the iterative integration of inviscid Euler equations and are thus com- 
putationally demanding. Each run of the Euler solver for a given set of parameters takes 
on the order of 5-20 hours on a high end workstation. Since simulation is expensive, there 
is interest in being able to automatically and adaptively design the experiment to learn 
about where response is most interesting, e.g., where uncertainty is largest or the structure 
is richest, and spend relatively more effort sampling in these areas. However, before a clever 
sequential design strategy can be devised, a nonstationary model is needed that can capture 
the differences in behavior between subsonic and supersonic physical regimes. 

The surrogate model comm only used to a.pproxi mate outputs to computer experiments 



is the Gaussian process (GP) (jSantner et al. 



20031 ). The GP is conceptually straightfor- 



ward, easily accommodates prior knowledge in the form of covariance functions, and returns 
estimates of predictive confldence. In spite of its simplicity, there are two important disad- 
vantages to the standard GP in this setting. Firstly, the computation time for inference on 
the GP scales poorly with the number of data points, typically growing with the cube of 
the sample size. But most importantly, GP models are usually stationary in that the same 
covariance structure is used throughout the entire input space. In the application of high- 
velocity computational fluid dynamics, where subsonic flow is quite different from supersonic 
flow, this limitation is unacceptable. Therefore, the error (standard deviation) associated 
with a predicted response under a GP model does not locally depend on any of the previ ously 



2008ah was 



observed output responses. The Bayesian treed GP model (iGramacy and Led , 
designed to overcome these limitations. 

Ideally, we would like to be able to combine the treed GP model with classic model-based 
optimal design algorithms. However, classic design algorithms are ill-suited to partition mod- 
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els and Bayesian Monte Carlo-based inference. They are inherently serial and thus tacitly 
assume a controlled and isolated computing environment. The modern supercomputer has 
thousands of computing nodes, or agents, designed to serve a multitude of diverse users. If 
the design strategy is not prepared to engage an agent as soon as it becomes available, then 
that resource is either wasted or devoted to another process. If design is to be sequential 
(which it must, in order to learn about the responses online, and adapt the model), then the 
interface must be asynchronous, and any computation must execute in parallel. There may 
not be time to re-fit the model and compute the next optimal design. So the final ingredients 
in our flexible design framework are active learning strategies from the Machine Learning 
literature. Such strategies have been used as fast, Mo nte Carlo-friendly, approximate alter- 



natives to optimal sequential design (jSeo et al. 



2OOOI). 



Thus this paper makes two primary contributions: an integrated sequential design strat- 
egy for a modern nonstationary model, and methods for designing experiments in an asyn- 
chronous parallel computing environment. Our hybrid design strategy puts together the treed 
CP model, classic sequential design, and active learning, resulting in a highly efficient non- 
stationary model and sequential design combination that balances optimality and flexibility. 
The remainder of this paper is organized as follows. Section |2] reviews the main ingredients: 
from conventional optimal designs and active learning strategies with the canonical station- 
ary CP model, to the nonstationary treed CP model with Bayesian model averaging and 
hybrid sequential design. Section [3] details our approach to the sequential design of super- 
computer experiments with the treed CP surrogate model. Illustrative examples are given 
on synthetic data in Section |H The motivating example of a supercomputer experiment in- 
volving computational fluid dynamics code for designing a re-usable launch vehicle (LGBB) 
is described in Section [5l Section [6] offers some discussion and avenues for further research. 
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2 Review 

Our approach to the adaptive design of supercomputer experiments combines classic design 
strategies and active learning with a modern approach to nonstationary spatial modeling. 
These topics are reviewed here. 

2.1 Surrogate Modeling 

In a computer experiment, the simulation output z{x.), for a particular (multivariate) input 
configuration value x, is typically modeled as a zero mean random process with covariance 
C(x, x') = a^A'(x, x'). The stationary Gaussian process (GP) is a popular example of such 



a model, an d consequent 



experiments (jSacks et al. 



y is t 



1989 



le canonical surrog ate model used in designing computer 



Santner et al. 



20031). 



For data D = {'xj , Zi}^^^ = {X, Z} of mx^dimensional inputs X and scalar observations 
Z under a GP model, the density over outputs at a new point x has a Normal distribution 
with 



mean 



variance 



5(x) = k^(x)K-^Z, and 
a2(x) = a2[K(x,x) - kT(x)K-ik(x)] 



where k'''(x) is the A^-vector whose z**" component is A'(x, Xj), and K is the N x N matrix 
with i,j element i^'(xj,Xj). It is important to note that the uncertainty, (T^(x), associated 
with the prediction has no direct dependence on the nearby observed simulation outputs Z; 
because of the assumption of stationarity, all response points contribute to the estimation 
of the local error through their influence on the correlation function K{-, ■), and the induced 
correlation matrix Kjj- = ii'(xj,Xj). We follow Gramacy and Lee (2008) in specifying that 
K{-,-) have the form 

A'(xj, Xfclfif) = K*{yij, Xfc) + gSj^k, (2) 
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where S.^. is the Kronecker delta function, and K* is a true correlation function. The g 
term, referred to as the nugget, is positive {g > 0) and provides a mechanism for introducing 
measurement error into the stochastic process. It arises when considering a model of the 
form z(x) = u;(x) + ?7(x) where w(x) is the random process with covariance C, and ri{-) 
is independent Gaussian noise. Valid correlation functions K*{-,-) are usually generated 
as a member of a parametric family, such as the separable power or Matern fami l ies. A 



general reference for families of correlation functions K* is provided by 
Hereafter we use the separable power family. 



AbrahamsenI (119971 ). 



mx 



i^*(xj,Xfc|d) = exp - 



Po 



i=l 



di 



(3) 



which is a standard choice in modeling computer experiments ( ISantner et al. 
Po = 2 and infer the range pa rameters \di}^ as part of our es t imati on procedure. 



20031). We fix 



While many authors (e.g.. 



Santner et al. 



2003 



Sacks et al. 



19891 ) deliberately omit the 



nugget parameter on the grounds that computer experiments are deterministic, we have 
found it more helpful to include a nugget. Most importantly, we found that the LGBB simu- 
lator was only theoretically deterministic, but not necessarily so in practice. Researchers at 
NASA explained to us that their numerical CFD solvers are typically started with random 
initial values, and involve forced random restarts when diagnostics indicate that convergence 
is poor. Furthermore, due to the sometimes chaotic behavior of the systems, input configu- 
rations arbitrarily close to one another can fail to achieve the same estimated convergence, 
even after satisfying the same stopping criterion. Thus a conventional GP model without 
a small-distance noise process (nugget) can be a mismatch to such potentially non-smooth 
data. As a secondary concern, numerical s tability in d ecomposing covariance matrices can 



be improved by using a small nugget term (jNeal 



19971). 
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2.1.1 Treed Gaussian process model 

Because of concerns about the inadequacy of tlie stationarity assumption, we propose a 
surrogate model that is new in the realm of sequential design of exp eriments: the Bayesian 



treed Gaussian process (treed GP) model (IGramacy and Lee 



2008a|). The treed GP model 



extends the Bayesian treed linear model 



jy using a GP model with linear trend indep endently 



within each region, ins tead of constant (jChipman et al. 



(jChipman et al. 



1998 



Denison et al 



1998 



Chipman et al. 



or linear 



19981) is 



2002) models in the partitions. A process prior (j( 
placed on the tree T, and conditional on T, parameters for R independent GPs in regions 
{i^u}u=i are specified via a hierarchical generative model: 

Z,|/3„a^,K,~iV„,,(F,/3„aX) /3o ~ iV„^(/x, B) ~ JGKA gj2) (4) 



where F;^ = (1,X^), W is a m x m matrix, and m = irtx + 1. IG, and W are the 
(Multivariate) Normal, Inverse-Gamma, and Wishart distributions, respectively. Kj, is the 
separable power family covariance matrix with a nugget, as in ([2H3]). The data {X, Z},^ 
in region r^, are used to estimate the parameters 0^, = {(3 , a"^ , K , t"^} y of the model active 
in the region. Parameters to the hierarchical priors depend only on {0^}^^^. Samples 
from the posterior distribution are gathered using Markov chain Monte Carlo (MCMC). All 
parameters can be sampled using Gibbs steps, except for the covariance structure, whose 
parameters can be sampled via Metropolis-Hastings. 

The predicted value oi Z{pi. E Vy) is normally distributed with mean and variance 

i(x) =E(Z(x)| data,xer,) = ^(x)^, + k,(x)TK;i(Z, - F,3j, (5) 
= Var(Z(x)| data,x G r,) = a2[fi:(x,x) - qJ(x)C;iq,(x)], (6) 
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where C^' = (K, + r.^F^WFj)"! q,(x) = k,(x) + r^F,WJ{^) (7) 

«:(x, y) = ir.(x, y) + rJf^(x)Wf (y) 



with f~'^(x) = (Ijx"""), and k;^(x) a n,^— vector with k^j(x) = A''^(x, Xj), for all Xj G X,^. 
The global process is nonstationary because of the tree (T) and thus o"^(x) in ([6]) is region- 
specific. The predictive surface can be discontinuous across the partition boundaries of a 
particular tree T. However, in the aggregate of samples collected from the joint posterior 
distribution of {T,6}, the mean tends to smooth out near likely partition boundaries as 
the tree operations grow, pr une, change, awap, and rot ate integrate over trees and GPs with 



larger posterior probability (IGramacy and Lee 



2008a|). Uncertainty in the posterior for T 



translates into higher posterior predictive uncertainty near region boundaries. When the 
data actually indicate a non-smooth process, e.g., as in the LGBB experiment in Section [5l 
the treed GP retains the capability t o model discontinuities . 
The Bayesian treed linear model ofl< 



Chipman et al 



(I2OO2I ) is implemented as a special case 



of the treed GP model, called the treed GP LLM (short for: "with jumps to the Limiting Lin- 
ear Model"). Detection of linearity in the response surface is facilitated on a per-dimension 
basis via the introduction of mx indicator-parameters b^, in each region r^, = 1,...,R, 
which are given a prior conditional on the range parameter(s) to K^{-, ■). The boolean fe^j 
determines whether the GP or its LLM governs the marginal process in the z**" dimension of 
region Vi,. The result, through Bayesian model averaging, is an adaptively semiparametric 
nonsta tionary regression model which can be faster, more parsimonious, and numerically 



stable (IGramacy and Lee 



2008bl ). Empirical evidence suggests that many computer exper- 
iments involve responses which are either linear in most of the input dimensions, or entirely 
linear in a subset of the input domain [see Section [5]. Thus the treed GP LLM is particularly 
well-suited to be a surrogate model for computer experiments. 

Compared to other approaches to nonstationary modeling, including using spatial defor- 
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matio ns (ISampson 



tions (jHigdon et al. 



1999 



Paciorek 



and Guttord. 119921: ISchmidt and O'Haganl . 120031 ) and process convolu 



20031 ). the treed GP LLM approach yields an extremely 
fast implementation of nonstationary GPs, providing a divide-and-conquer approach to spa- 
tial modeling. Although the method is especially well-suited to axis-aligned nonstationarity, 



which is common in computer experiments, it has been found to compare favorab 



y in si tu- 



2008a 



Id). 



ations when the nature of the nonstationarity is more general (iGramacy and Lee 
For example, in Section 14.21 we consider a dataset where the treed GP is quite appropriate 
even though the correlation structure varies radially, and moreover, which is well-fit by a 
stationary model for small designs. In Section 14.31 we consider a high dimensional dataset 
where the nature of the nonstationarity is unknown. Software implementing the treed GP 
LLM model and all of its special cases (e.g., stat ionary GP, CART fc the treed line ar model. 



20041 ). and can 



linear model, etc.) is available as an R package (IR Development Core Team 
be obtained from GRAN: 

http : //www. cran. r-project . org/web/packages/tgp/ index .html. 

The package implements a family of default prior specifications, i.e., settings for the constants 
in (jlj). In this paper we use these defaults unles s otherwise no ted. For more d etails see the 



tgp documentation (IGramacy and Taddy 



20081 ) and tutorial (IGramacy 



2003). 



2.2 Sequential design of experiments 

In the statistics community, the traditional approach to sequential data solicitation is called 
(Sequential) Design of Experiments (DOE) or Sequential Design and 



puter Exyerinients 



Gurrin et al. 



1991 



SDAGE) when applied to computer s imulations (1 Sacks et al. 



Welch et al 



1992 



Santner et al 



Analysis o 


f Com- 


(Sacks et al. 




1989; 



20031 ). Depending on whether the 



goal of the experiment is inference or prediction, as described by a choice of utility, differ- 
ent algorithms for obtaining optimal designs can be derived. For example, one can choose 
the KuUback-Leibler distance between the posterior and prior distributions as a utility. 
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For Gaussian process models with correlation matrix K, this is equivalent to maximizing 
det(K). Subsequen t ly chosen input configurati ons are called maximum entropy designs (e.g., 



Shewry and Wynn 



1987 



Santner et al 



approaches to DOE is provided by 



20031. Chapter 6). An ex cellent review of Bayesian 



Chaloner and Verdinellil (119951 ) 



Finding optimal designs can be computationally intensive, especially for stationary GP 
surrogate models, because the algorithms usually involve repeated decompositions of large co- 
variance matrices. Determinant-space, for example, can have many local maxima which can 



be so ught-after via stochastic search, i.e., simulated annealing, genetic algorithms (IHamada et al. 



200ll ). etc. A parametric family is assumed, either with fixed parameter values, or a prelimi- 
nary analysis is used to find maximum likelihood estimates for its parameters, which are then 
treated as "known" . In a sequential design, parameters estimated from previous designs can 
be used, whereas a Bayesia n decision theoretic approach may "choose" a parameterization 



and optimal design jointly (IMuUer et al. 



20041 ). In all of these approaches, it is important 



to note that optimality is only with respect to the assumed parametric form. Should this 
form not be known a priori, as is often the case in practice, then the resulting designs could 
be far from optimal. 



Other nonparametric approaches used in the statistics literature include space fi 



1979 



Santner et al. 



l ing de 



20031). 



signs, e.g., maximin distance designs and LHS (iMcKay et al. 
Computing maximin distance designs can also be computationally intensive, whereas LHSs 
are easy to compute and result in well-spaced configurations r elative to random sa mpling, 



20031 ). LHSs 



though there are some degenerate cases, such as diagonal LHSs ( ISantner et al.l . 
can also be less advantageous in a sequential sampling environment since there is no mech- 
anism to ensure that the configurations will be well-spaced relative to previously sampled 
(fixed) locations. Maximum entropy designs, and maximin designs, may be more computa- 
tionally demanding, but they are easily converted into sequential design methods by simply 



fixing the locations of samples whose response has already been obtained, and then optimiz- 
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ing only over new sample locations. 



2.2.1 An active learning approach sequential experimental design 



In the world of Machine learning, design of experiments would (l oosely) fa l 



of a research focus called active learning. In the literature (lAngluin 



under the blanket 



1987 



Atlas et al. 



1990l ). active learning, or equivalently query learning or selective sampling, refers to the 
situation where a learning algorithm has some, perhaps limited, control over the inputs 
it trains on. There are essentially two active learning approaches to DOE using the GP. 
The first approach tries to maximize the information gained about model parameters by 
selecting from a set of candidates X, the location x G X which has the greatest standard 
deviation in predicted output. This approach, called ALM for Active Le arning-MacKay , 



has been shown to app r oxima te maximum expected information designs ( iMacKay 



Kleijnen and van Beerd (120041 ) take a similar approach 



19921 ). 



An alternative algorithm, called ALC for Active Learning-Cohn, is to select x G X 



maxi mizing the expected reduction in squared error averaged over the input space (jCohn 



19961 ). Using the notation from ([T]) for stationary GPs, and supposing that the location x is 
added into the design, a global reduction in predictive variance can be obtained by averaging 
over other locations y: 



Aa'fx 



A'^I(y)= / ^'(y)-'^l(y) = 

y Jy Jy 



a' [kT(y)K]vik(5c)-K(i,y 



A'('x,x) 



k(i)TK^ik(i) 



In practice the integral is replaced by a sum over a grid of locations Y, typically with Y = X, 
and the parame terization to the model, i.e., K{-, ■) and cr^, is assumed known in advance. 



Seo et al. 



(120001 ) provide a comparison for stationary GPs between ALC and ALM. 
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2.2.2 Other approaches to designing computer experiments 



Bayesian and no n-Bayesian 



iments abound ( 


Sacks et al. 


Sebastiani and Wvnn. 


2000 



a pproaches to surrogate modeling; and design 



1989 



Currin et al. 



1991 



Kennedy and O'Hagan 



Welch et al. 



2000 



1992 



br computer 



Bates et al. 



exper 



1996 



20011 ). A recent approach, which 



bears some similarity to ours, uses stationary GPs and a so-called syat ial aggregate lanquaqe 



to ai d in an active data mining of the input space of the experiment (IRamakrishnan et al 



20051 ). Our use of nonstationary surrogate models within a highly distributed supercomputer 
architecture distinguishes our work from the methods described in those papers, and yields 
a more dynamic approach to sequential design. 



3 Adaptive sequential design 

Much of the current work in large scale computer models starts by evaluating the model over 
a hand crafted set of input configurations, such as a full grid or some reduced design. After 
the initial set has been run, a human may identify interesting regions and perform additional 
runs if desired. We are concerned with automating this process, based on local estimates of 
uncertainty that can provided by the nonstationary treed GP LLM surrogate modeljll 

3.1 Asynchronous distributed super computing 

High fidelity supercomputer experiments are usually run on clusters of independent comput- 
ing agents, or processors. A Beowulf cluster is a good example. At any given time, each 
agent is working on a single input configuration. Multiple agents allow several input configu- 
rations to be run in parallel. Simulations for new configurations begin when an agent finishes 
execution and becomes available. Therefore, simulations may start and finish at different, 
perhaps even random, times. The cluster is managed asynchronously by a master controller 
( emcee ) program that gathers responses from finished simulations, and supplies free agents 



^We shall drop the LLM tag in what follows, and consider it implied by the label treed GP. 
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candidate 
queue 



Sampler J 



EMCEE 



( 


sim node 1 


) 


( 


sim node 2 


) 


( 


sim node 3 


) 




( 


sim node M 


) 



Figure 2: Emcee program gives finished simulations to the sampler (which populates the 
queue based on a surrogate model) and gets new ones from the queue. 



with new input configurations. The goal is to have the emcee program interact with a non- 
stationary modeling and sequential design program that maintains a queue of well-chosen 
candidates, and to which it provides finished responses as they become available, so that the 
surrogate model can be updated [see Figure [2] . 

3.2 Adaptive sequential DOE via active learning 

In the statistics community, there are a number of established methodologies for (sequen- 
tially) designing experiments [see Section [2^2] . However, some classic criticisms for traditional 
DOE approaches precluded such an approach here. The primary issue is that "optimally" 
chosen design points are usually along the boundary of the region, where measurement error 
can be severe, responses can be d ifficult to elicit, and model checking is often not feasi- 



ble (jChaloner and Verdinelli 



19951 ). Furthermore, boundary points are only optimal when 
the model is known precisely. For example, in one-dimensional linear regression, putting half 
the points at one boundary and half at the other is only optimal if the true model is linear; 
if it turns out that the truth may be quadratic, then forcing all points to the boundaries 
is highly suboptimal. Similarly here, where we do not know the full form of the model in 
advance, it is important to favor internal points so that the model (including the partitions) 
can be learned correctly. Other drawbacks to the traditional DOE approach include speed, 
the difficulty inherent in using Monte Carlo to estimate the surrogate model, lack of support 
for partition models, and the desire to design for an asynchronous emcee interface where 
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responses and computing nodes become available at random times. 

Instead, we take a two-stage (hybrid) approach that combines standard DOE with meth- 
ods from the active learning literature. The first stage is to use optimal sequential designs 
from the DOE literature, such as maximum entropy, maximin designs, or LHS, as candidates 
for future sampling. This ensures that candidates for future sampling are well-spaced out 
relative to themselves, and to the already sampled locations. In the second stage, the treed 
GP surrogate model can provide Monte Carlo estimates of region-specific model uncertainty, 
via the ALM or ALC algorithm, which can be used to populate, and sequence, the candi- 
date queue used by the emcee [see Figure [2] . This ensures that the most informative of the 
optimally spaced candidates can be first in line for simulation when agents become available. 

3.3 ALM and ALC algorithms 

Given a set of candidate input configurations X, Section [2.2. II introduced two active learning 
criteria for choosing amongst — or ordering — them based on the posterior predictive distri- 
buti on. ALM choo ses the x G X with the greatest standard deviation in predicted out- 



put (jMacKay 



19921 ). MCMC posterior predictive samples provide a convenient estimate of 
location-specific variance, namely the width of predictive quantiles. 

Alternatively, ALC selects t he x that ma ximizes the expected reduction in squared error 



averaged over the input space (jCohn 



19961 ). Conditioning on T, the reduction in variance 
at a point y G r,y, given that the location x G X^^ is added into the data, is defined as (region 
subscripts suppressed) : 

Aa|(y) = ^^(y) - a|(y) where ^^(y) = a'[K{y,y) - ql{y)C],WNiy)l 

and (T|(y) = (7^[K(y,y) - q^+i(y)C^^+iq7v+i(y)] 

using notation for the GP predictive variance for region given in (Q. Note that the + 
component of qAr+i(y), and the corresponding column and row of Cat+i, are a function of 
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X. The partition inverse equations (jBarnettl . 119791 ). for a covariance matrix Cat+i in terms 
of Cat, yield: 



..2t N ^ q^(y)Cjv qjv(x) - K(x,y) 
k(x,x) - q;,(x)C^ qAr(x) 

The details of this derivation are included in Appendix lA.li For y and x not in the same 
region r^, let AcxKy) = 0. Rather than integrating, as in ([H]), the reduction in predictive 
variance that would be obtained by adding x into the dataset is calculated in practice by 
averaging over a grid or candidate set of y G Y: 



Aa2(i) = |Y|-i J]Aa|(y) 
yeY 



(10) 



which can be approximated using MCMC methods. Compared to ALM, adaptive samples 
under ALC are less heavily concentrated near the boundaries of the partitions. Both provide 
a ranking of a set of candidate l ocations x G X. Com putational demands are in 0(|X|) for 



ALM, and 0(|X||Y|) for ALC. 



McKay et al. 



(119791 ) provide a comparison between ALM, 



and LHS, on computer code data. Seo et al. (2000) provide comparisons between ALC and 
ALM using standard CPs, taking Y = X to be the full set of un-sampled locations in a 
pre-specified dense uniform grid. In both papers, the model is assumed known in advance. 

However, that last assumption, that the model is known a priori is at loggerheads with 
sequential design — if the model were already known then why design sequentially? In the 
treed CP application of ALC, the model is not assumed known a priori. Instead, Bayesian 
MCMC posterior inference on {T,6} is performed, and then samples from A(j|(y) are 
taken conditional on samples from {T,0}. To mitigate the (possibly enormous) expense of 
sampling Acr|(y) via MCMC on a dense high-dimensional grid (with Y = X), a smaller and 
more cleverly-chosen set of candidates can come from the sequential treed maximum entropy 
design, described in the following subsection. The idea is to sequentially select candidates 
which are well-spaced relative both to themselves and to the already sampled configurations. 
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in order to encourage exploration. 

Applying the ALC algorithm under the limiting linear model (LLM) is computationally 
less intensive compared to ALC under a full GP. Starting with the predictive variance given 
in ([7]), the expected reduction in variance under the linear model is given in ffTTj) . below, and 
averaging over y proceeds as in ffTOl) . above. 

^^-^y> = l+, + fT(x)V,„f(x) 

Appendix IA.2I contains details of the derivation. The m x m matrix V^^ is the posterior 
variance of f3 based on the data points in the current design. Since only an 0{m^) inverse 
operation is required, Eq. ( fTTl) is preferred over replacing K with the N x N matrix 1(1 + g) 
in (Q, which requires an 0{N^) inverse. 

3.4 Choosing candidates 

We have already discussed how a large, i.e., densely gridded, candidate set X can make 
for computationally expensive ALM and (especially) ALC calculations. In an asynchronous 
parallel environment, there is another reason why candidate designs should not be too dense. 
Suppose we are using the ALM algorithm, and we estimate the uncertainty to be highest 
in a particular region of the space. If two candidates are close to each other in this region, 
then they will have the highest and second-highest priority, and the emcee could send both 
of them to agents. However, if we knew we were going to send off two runs, we generally 
would not want to pick those two right next to each other, but would want to pick two points 
from different parts of the space. If each design point could be picked sequentially, then the 
candidate spacing is not an issue, because the model can be re-fit and the points re-ordered 
between runs. In the reality of an asynchronous parallel environment, there may not be time 
to re-fit the model before the emcee needs an additional run configuration to send to another 
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agent. Thus there is a real need for well- spaced candidates. 



A sequential maximum entropy design (IShewry and Wynn 



1991 



Welch et al. 



1992 



Santner et al. 



1987 



Sacks et al. 



1989 



Currin et al. 



20031 . Chapter 6) may seem like a reasonable ap- 



proach because it encourages exploration. But traditional maximum entropy designs are 
based on a known parameterization of a single GP model, and are thus not well-suited to 
MCMC based treed partition models wherein "closeness" is not measured homogeneously 
throughout the input space. Furthermore, a maximum entropy design may not choose can- 
didates in the "interesting" part of the input space because sampling is high there already. 
E.g., in the rocket booster application we want to continue to sample close to Mach one 
because we need many more points to understand the function where it is changing quickly. 
Another disadvantage to maximum entropy designs is computational, requiring repeated 
decompositions of large covariance matrices. 

One possible solution to both computational and nonstationary modeling issues is to use 
what we call a (sequential) treed maximum entropy design. That is, a separate sequential 
maximum entropy design can be obtained in each of the partitions depicted by the maximum 
a posteriori (MAP) tree T. The number of candidates selected from each region, {f^}^^^ of 
T, can be proportional to the volume or the number of grid locations in the region. MAP 
parameters 0^1'^' can be used in creating the candidate design, or "neutral" or "exploration 
encouraging" parameters can be used instead. Separating design from inference by using 



custom parameterization s in design steps, rat 



the SDACE community (jSantner et al 



ler than inferred ones, is a common practice in 



20031 ). Small range parameters, for learning about 



the wiggliness of the response, and a modest nugget parameter for numerical stability, tend 
to work well together. 

Since optimal design is only used to select candidates, and is not the final step in adap- 
tively choosing samples, employing a high-powered search algorithm (e.g., a genetic algo- 
rithm) is unnecessary. Finding a local maximum is generally sufficient to get well-spaced 
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candidates. We use a simple stochastic ascent algorithKo in each of the R regions {ti,}^^-^ of 
T to find local maxima without calculating too many determinants. The R search algorithms 
can be run in parallel, and typically invert matrices much smaller than N x N. 



Sequential Treed Maximum Entropy Design for Exponential data 




Figure 3: Example of a treed maximum entropy design in 2-d. Open Circles represent 
previously sampled locations. Solid dots are the candidate design based on T, also shown. 



Figure [3] shows an example sequential treed maximum entropy design for the 2-d Ex- 
ponential data [Section 14. 2j . found by simple stochastic search. Input configurations are 
sub-sampled from a LHS of size 400, and the chosen candidate design is of size ~40 (i.e., 
[10%]). Dots in the figure represent the chosen locations of the new candidate design X 
relative to the existing sampled locations X (circles) . Candidates are reasonably spaced-out 
relative to one another, and to existing inputs, except possibly near partition boundaries. 
There are roughly the same number of candidates in each quadrant, despite the fact that 
the density of samples (circles) in the first quadrant is almost two-times that of the others. 
A classical (non-treed) maximum entropy design would have chosen fewer points in the first 
quadrant, where all the action is, in order to equalize the density relative to the other three 
quadrants. 



^For example, we find that the fohowing works weU for constructing a set of candidates X of size |X| — N'. 
Construct a LHS L of size lOiV', and initialize X to be a random subsample of L of size N' , without 
replacement. Then randomly propose to swap a single element of X with one from L \ X and accept only 
upon an observed increase in det(K([X, X])). Repeat until the acceptance rate is low, possibly with reference 
to N'. 
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3.5 Implementation methodology 

Bayesian adaptive sampling (BAS) proceeds in trials. Suppose that samples and their 
responses have been gathered in previous trials, or from a small initial design before the first 
trial. In the current trial, a treed GP model is estimated for data {'xJ,Zi}fLi. Samples are 
gathered in accordance with ALM or ALC conditional on {0,T}, at candidate locations X 
chosen to follow a sequential treed maximum entropy design, using the MAP tree obtained 
in the previous trial. The candidate queue is then populated with a sorted list of candidates. 
BAS gathers finished and running input configurations from the emcee and adds them into the 
design. Predictive mean estimates are used as surrogate responses for unfinished (running) 
configurations until the true response is available. New trials start with fresh candidates. 

An artificial clustered simulation environment, with a fixed number of agents, was devel- 
oped in order to simulate the parallel and asynchronous evaluation of input configurations, 
whose responses finish at random times. It was implemented Perl and was designed to 
mimic, and interface with, the Perl modules at NASA which drive their experimental de- 
sign software. Experiments on synthetic data, in the next section, will use this artificial 
environment. The LGBB experiment in Section [5] uses the real Perl modules to submit 
jobs to the real NASA supercomputer. Multi-dimensional responses, as in the LGBB exper- 
iment, are treated as independent. That is, each response has its own treed GP surrogate 
model, mz surrogates total for an m^-dimensional response. Uncertainty estimates (via 
ALM or ALC) are normalized and pooled across the models for each response in order 
to develop a single (sequential) design for the entire process. Treating highly correlated 
physical measurements as independent is a crude approach. However, it still affords remark- 
able results, and allows the use of the PThreads parallel computing library to get a highly 
parallel implementation and take advantage of multi-core processors that are becoming com- 
monplace. Cou£led__with__th^ model for parallelizing prediction and es- 



timation (IGramacy and Lee 



2008al ). a factor of 2m z speedup for 2mz processors can be 
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obtainedo Cokriging (IVer Hoef and Barry 



19981 ). co-regionalization (ISchmidt and Gelfand 



20031 ). and other approaches to modehng multivariate responses are obvious extensions, but 
he beyond the scope of the present work, and are not easily parallelizable. The MAP tree 
T, used for creating sequential treed maximum entropy candidates, is taken from the treed 
GP surrogates of each of the mz responses in turn. 

Chipman et al. (1998) recommend running several parallel chains, and sub-sampling from 
all chains in order better explore the posterior distribution of the tree (T). Rather than run 
multiple chains explicitly, the trial nature of adaptive sampling can be exploited: at the 
beginning of each trial the tree can be randomly pruned back. Although the tree chain 
associated with an individual trial may find itself stuck in a local mode of the posterior, in 
the aggregate of all trials the chain(s) explore the posterior of tree-space nicely. Random 
pruning represents a compromise between restarting and initializing the tree at a well- 
chosen starting place. This tree inertia usually affords shorter burn-in of the MCMC at the 
beginning of each trial. The tree can also be initialized with a run of the Bayesian treed LM, 
for a faster burn-in of the treed GP chain. 

Each trial executes at least B burn-in and T total MCMC sampling rounds. Samples 
are saved every E rounds in order to reduce the correlation between draws by thinning. 
Samples of ALM and ALC statistics need only be gathered every E rounds, so thinning 
cuts down on the computational burden as well. If the emcee has no responses waiting to 
be incorporated by BAS at the end of T MCMC rounds, then BAS can run more MCMC 
rounds, either continuing where it left off, or after re-starting the tree (but saving all samples 
from each chain). New trials, with new candidates, start only when the emcee is ready with 
a new finished response. Such is the design so that the computing time of each BAS trial 
does not affect the rate of sampling. Rather, a slow BAS runs fewer MCMC rounds per 



^For more information on the parallel implementation, please see Appendix C.2 of iGramacvl ( 20071 ) or 
the tgp package vignette. 
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finished response, and re-sorts candidates less often compared to a faster BAS. A slower 
adaptive sampler yields less optimal sequential samples, but still offers an improvement over 
naive gridding. In the experiments that follow in the next two sections, the MCMC for the 
surrogate model was run with B = 2000, T = 7000 and E = 2. 

4 Illustrative examples 

In this section, sequential designs are built for three synthetic datasets with the treed GP 
LLM as a surrogate model. The supercomputer was simulated as having five independent 
nodes which could provide responses for inputs in a time of 20 seconds plus a random number 
of seconds having a Poisson distribution with mean 20. The examples in this section use the 
default prior specification provided by the tgp package, which involves using an improper 
prior for /3 obtained by fixing /3q = and = oo in Eq. (j4]) of Section I2.1.1[ 

4.1 1-d Synthetic Sinusoidal data 

Co nsider some synthetic sinu soidal data first used by llligdon (2002), and then augmented 



by 



Gramacy and Led (l2008al ) to contain a linear region: 



sin(f) + lcos(^) x<10 
z{x) = { (12) 

x/10 — 1 otherwise, 

observed with A^(0, a = 0.1) noise. Figure|4]shows three snap shots, illustrating the evolution 
of BAS on this data using the ALC algorithm with sequential treed maximum entropy 
candidates. The first column shows the estimated surface in terms of posterior predictive 
means (solid-black) and 90% intervals (dashed-red). The MAP tree T is shown as well. The 
second column summarizes the ALM and ALC statistics (scaled to show alongside ALM) 
for comparison. Ten samples from a sequential maximum entropy design were used to start 
things off, and twenty candidates from a treed maximum entropy design were proposed 
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during each adaptive sampling round. 



Estimated Surface ALM and ALC stats 




5 10 15 20 5 10 15 20 




5 10 15 20 5 10 15 20 



Estimated Suiiace ALiUI and ALC stats 




5 10 15 20 5 10 15 20 



Figure 4: Sine data after 30 (top), 45 (middle), and 97 (bottom) adaptively chosen samples. 
Left: posterior predictive mean and 90% quantiles, and MAP partition T. Right: ALM 
(black-solid) and ALC (red-dashed). 

The snapshot in the top row of Figure H] was taken after BAS had gathered a total of 
30 samples, having learned that there is probably one partition near a; = 10, with roughly 
the same number of samples on each side. Predictive uncertainty (under both ALM and 
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ALC) is higher on the left side than on the right. ALM and ALC are in relative agreement, 
however the transition of ALC over the partition boundary is more smooth. The ALM 
statistics are "noisier" than ALC because the former is based on quantiles, and the latter 
on averages ffTOj) . Although both ALM and ALC are shown, only ALC was used to select 
adaptive samples. The middle row of Figure H] shows a snapshot taken after 45 samples were 
gathered, where we can see that BAS has sampled more heavily in the sinusoidal region (by 
a factor of two), and learned a great deal. ALM and ALC are in less agreement here than in 
the row above. Also, ALC is far less concerned with uncertainty near the partition boundary, 
than it is, say, near x = 1. Finally, the snapshot in the bottom row of Figure H] was taken 
after 97 samples had been gathered. By now, BAS has learned about the secondary cosine 
structure in the left-hand region. It has focused almost three times more of its sampling 
effort there. ALM and ALC agree that there is high uncertainty near the partition boundary 
(T), but otherwise disagree about where to sample next. Any further sampling would yield 
only marginal improvements since this final surface, in the bottom-left panel, is a very good 
approximation to the truth. 

In summary, the left panels of Figure H] track the treed CP model's improvements in its 
ability to predict the mean, via the increase in resolution from one figure to the next. From 
the scale of y-axes in the right column one can see that as more samples are gathered, the 
variance in the posterior predictive distribution of the treed CP decreases as well. Despite 
the disagreements between ALM and ALC on individual iterations during the evolution of 
BAS, it is interesting to note that difference between using ALC and ALM on this dataset is 
negligible. This is likely due to the high quality of the candidates X, from a sequential treed 
maximum entropy design, which prevent the clumping behavior that tends to hurt ALM, 
but to which ALC is somewhat less prone. 

Perhaps the best illustration of how BAS l earns and adapt s over time is to compare it 



to something that is, ostensibly, less adaptive. 



Gramacy et al. 



(120041 ) show how the mean- 
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squared error (MSE) of BAS evolves over time on a similar dataset, but in a serial setting 
where only one sample is taken at a time, and the surrogate model is allowed to re-fit before 
the next (single) adaptive sample is chosen. They show how the MSE of BAS decreases 
steadily as samples are added, despite that fewer points are added in the linear region, 
yielding a sequential design strategy which is two times more efficient than LHS. They also 
show how BAS measures up against ALM and ALC, as implemented by Seo et al. (2000) — 
with a stationary GP surrogate model. Seo et al. make the very powerful assumption that 
the correct covariance structure is known at the start of sampling. Thus, the model need not 
be updated in light of new responses. Alternatively, BAS quickly gathers enough samples to 
learn the partitioned covariance structure, after which it outperforms ALM and ALC based 
on a stationary model. 

4.2 2-d Synthetic Exponential data 

The nonstationary treed GP surrogate model has an even greater impact on adaptive sam- 
pling in a higher dimensional input space. For an illustration, consider the domain [—2, 6] x 
[—2,6] wherein the true response is given by z(x) = a;iexp(— a;^ — x^), observed with 
iV(0, a = 0.001) noise. We take an initial set of 16 configurations from a maximum en- 
tropy design, and twenty new candidates (from a sequential treed maximum entropy design) 
are used in each adaptive sampling round. The top row of Figure |5] shows a snapshot after 
30 adaptive samples have been gathered with BAS under the ALC algorithm. Room for 
improvement is evident in the mean predictive surface {left column). The second column 
shows the ALC surface, and the single partition of T, with samples evenly split between the 
two regions. Observe in the ALC plot that the model also considers a tree with a single split 
along the other axis, indicating good mixing of the reversible jump Markov chain in tree 
space. 

After 50 adaptive samples have been selected {second-row of Figure [5]), the situation is 
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Posterior Predictive: z mean 



Expected Reduction in Squared Error: z ALC stats 





Posterior Predictive: z mean 



Expected Reduction in Squared Error: z ALC stats 





Posterior Predictive: z mean 



Expected Reduction in Squared Error: z ALC stats 




Figure 5: Exponential data after 30 (top), 50 (middle) and 80 (bottom) adaptively chosen 
samples. Left: posterior predictive mean surface; Right: ALC criterion surface, with MAP 
tree T and samples X overlayed. 
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greatly improved. ALC asserts that the first quadrant is most interesting, and as a result 
(adaptive) sampling is higher in this region. The dots in Figure [3] illustrate the candidates 
from a sequential treed maximum entropy design used during this round. [Notice that the 
X locations (circles) in Figure [3] match the dots in the center row of Figure O] Finally, the 
bottom row of Figure [5] shows the snapshot taken after 80 adaptive samples. More than 
54% of the samples are located in the first quadrant which occupies only 25% of the total 
input space. As before, the final surface shown in the bottom-left of the figure is a very good 
approximation to the truth. 

When comparing to LHS, ALM, and ALC with statio n ary G Ps, much the same can 



be said here as with the sinusoidal data. 



Gramacy et al. 



fl2004f ) show that the MSB of 



BAS decreases steadily as samples are added in a serial fashion, despite that most of the 
sampling occurs in the first quadrant, and that it is at least two-times more efficient than 
LHS. Crucially, the exponential data are not defined by step functions, in contrast with the 
sinusoidal data. Transitions between partitions are smooth. Thus it takes BAS longer to 
learn about T — which in this case can be thought of as a design tool rather than a model 
assumption (since the data are well fit by a stationary CP) — and the corresponding three CP 
models in each region of T. Once it does however (after about 50 samples) BAS outperforms 
the (in hindsight) well-parameterized stationary model with ALM. 

To highlight the benefits of using a treed model in sequential design, we consider a deeper 
comparison on this data with results summarized in Table [H The comparison involves 
combinations of five models: Bayesian CART, treed linear models, (stationary) CP, treed 
CP, and treed CP LLM; three ways of generating AS candidates: LHS, maximum entropy 
and treed maximum entropy; and two adaptive sampling heuristics: ALC and ALM. The 
table shows RMSE to the truth as evaluated on a dense grid, for 30 repeated BAS runs each 
starting with a random initial maximum entropy design of 20 configurations, and then 55 
samples chosen adaptively (for 75 total). For fairness, the final RMSE calculation (in each 
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Table 1: Comparing a combination of five models: Bayesian CART, treed linear models, 
(stationary) CP, treed CP, and treed CP LLM (labeled bcart, btlm, bgp, btgp, btgpllm); 
three ways of generating AS candidates: LHS, sequential maximum entropy and sequential 
treed maximum entropy (Ih, me, tme); and two AS heuristics ALC and ALM, in terms of 
RMSE to the truth. Observe that the bgp/tme combination is not run since the stationary 
CP model does not provide a MAP tree as is required for sequential treed maximum entropy 
design. Therefore there are 28 combinations instead of 30. 

case) is based on the predictive means sampled from a full treed CP LLM model on a dense 
grid of predictive locations, regardless of the method used for sequential design. The table 
is sorted on the fourth column (RMSE). Somewhat surprisingly, Bayesian CART does really 
well if its candidates come from a sequential treed maximum entropy designO Also, ALM 
and ALC perform about equally as well as one another on this data, though we suspect that 
ALC would do better than ALM if the data were heteroskedastic, in which case ALM would 
concentrate samples in the high noise region even if the mean in that region is tame. Finally, 
LHS candidates do better than ones from a sequential maximum entropy design (with the 
notable exception of Bayesian CART) . The non-treed (stationary) CP model and non-treed 

*Note that the using the full treed GP LLM for the RMSE calculation is crucial here. Had Bayesian 
CART been used instead it would have been ranked much lower. 
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maximum entropy designs are the worst in the study. That the treed GP does better than 
the treed GP LLM is perhaps to be expected as there is nothing at all linear about this 
dataset. 

4.3 Six-dimensional example 

As an example of a higher- dimensional problem, we present a 6-d example, with true response 

z{xi,X2, X3, X4, X5, Xq) = exp {sin ([0.9 * (xi + 0.48)]^°) } + X2X3 + X4 . (13) 

This function has four active variables. It is of continuously varying wiggliness in the first 
dimension where the smoothness varies over the space without any natural threshold. The 
treed GP will usually partition on this dimension, typically somewhere between 0.6 and 
0.85, which will allow more of the adaptive sampling effort to be put on the more quickly 
oscillating part near xi = 1. The response is smooth but non-linear in the second and third 
dimensions, and linear in the fourth dimension. The final two variables are pure noise, which 
the treed GP will need to learn about adaptively. 

method Avg(rmse) SE(rmse) 

btgpllm-linburn/tme/alc 0.02871943 0.0006446596 

btgpUm/tme/alc 0.03217273 0.0037997994 

no adaptive samphng 0.03598135 0.0034438090 

Table 2: RMSE to the truth as evaluated on random LHSs of size 1000 in [0, 1]^, summarized 
for 10 repeated AS runs on the 6-d example. 

Our experiment allows the inputs to vary in [0, 1]^ and the response in fll3l) is observed 
with A^(0, a = 0.05) noise. We used a similar artificial clustered simulation environment 
to the one described in Section 13.51 At any time there are five nodes available to evaluate 
responses, which finish in no sooner than 3 minutes, plus a random number of seconds 
distributed as Pois(180). Table [2] shows average RMSE to the truth as evaluated on 10 
random LHSs of size 1000 in [0, 1]^ (with standard errors), for 10 repeated BAS runs. Each 
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run starts with a random initial set of 400 configurations from a maximum entropy design, 
and then 400 samples are chosen adaptively. We compare the Bayesian treed GP LLM model 
with ALC, both with and without LM burn-in, to a non-adaptive maximum entropy design 
of size 800. A stationary (non-treed) GP model was excluded from the comparison due 
to time constraints. As before, we use the full treed GP LLM model to calculate RMSEs 
after the adaptive sampling run(s), for fairness. Though there is some variation across the 
10 runs, the RMSE values obtained in each run were always row-ordered as they are, in 
the table, with the non- adaptive method coming last, and the treed GP LLM version with 
linear burn-in and ALC coming first. That is, although the differences between the average 
RMSEs in the table appear to be modest, the improvement obtained by BAS is statistically 
significant. 

5 LGBB CFD experiment 

The final experiment is our motivating example, a computational fluid dynamics simulator of 
a proposed reusable NASA launch vehicle, called the Langley Glide-Back Booster (LGBB). 
Three input parameters are varied (side slip angle, speed, and angle of attack), and for 
each setting of the input parameters, six outputs (lift, drag, pitch, side-force, yaw, and roll) 
are monitored. All six responses are computed simultaneously. In a previous experiment, 
a supercomputer interface was used to launch runs at over 3,250 input configurations in 
several hand-crafted batches. Figure [1] plots the resulting lift response as a function of 
Mach (speed) and alpha (angle of attack), with beta (side-slip a ngle) fixed to zero. A more 



detailed description of this system and its results are p rovided by 



preliminary adaptive sampling of this data appeared in 



Rogers et al. 



Gramacy et al 



(120031). Some 



(|2004J ). although that 



paper dealt with only a single output, considered only one-at-a-time updates, involved only 
a simulation of a computer experiment, and only resampled points from the hand-crafted 
initial run of 3,250. Here we describe the development of a live sequential design in the 
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fully asynchronous NASA environment. In a separate, non-adaptive, analysis of this data 



Gramacy and Led (l2008al ) noticed that the noise structure was heteroskedastic, so we have 
chosen to use the ALC statistic to guide the adaptive sampling. 

BAS for the LGBB is illustrated pictorially by the remaining figures in this section. The 
experiment was implemented on the NASA supercomputer Columbia — a fast and highly 
parallelized architecture, but with an extremely variable workload. The emcee algorithm of 
Section [3TT] was designed to interface with AeroDB, a database queuing system used by NASA 
to submit jobs to Columbia, and a set of CFD simulation codes called cartSd. To minimize 
impact on the queue, the emcee was restricted to ten submitted simulation jobs at a time. 
Candidate locations were sub-sampled from a 3-d grid consisting of 37,909 configurations. 
The design was initialized with 30 candidates from a maximum entropy design, and 100 new 
candidates (from a treed maximum entropy design) were proposed during each AS round. 
We use full hierarchical prior for /3, described by Eq. (j4j) in Section 12.1.11 by augmenting the 
default with the augment bprior = "bO". The pooling of means implied by the hierarchical 
prior is appropriate for this data since it is believed that the vast expanse of the response 
surface — for speeds greater than Mach 1 — is largely homogeneous. 

Figure [6] shows the 780 configurations sampled by BAS for the LGBB experiment. The left 
panel shows locations as a function of Mach (speed) and alpha (angle of attack), projecting 
over beta (side slip angle); the right panel shows Mach versus beta, projecting over alpha; 
the middle panel shows the beta = slice. NASA recommended treating beta as discrete, 
so we used a set of values which they provided. We can see that most of the configurations 
chosen by BAS were located near Mach one, with highest density for large alpha. Samples 
are scarce for Mach greater than two and are relatively uniform across all beta settings. A 
small amount of random noise has been added to the beta values in the plots (bottom-left) 
for visibility purposes. 

After samples are gathered, the treed GP model can be used for Monte Carlo estimation 
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Adaptive Samples, beta projection Sampled Input Configurations (beta=0) & Partitions 
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Figure 6: Adaptively sampled configurations projected over beta (side-slip angle; top-left), 
for fixed beta = (top-right) with MAP partition T, and then over alpha (angle of attack; 
bottom-left) . 

of the posterior predictive distribution. The upper-left plot in Figure [7] shows a slice of the 
lift response, for fixed beta plotted as a function of Mach and alpha. [The upper-right panel 
of Figure [6] shows the corresponding adaptively sampled configurations and MAP tree T 
(for beta = 0).] The MAP partition separates out the near-Mach-one region. Samples are 
densely concentrated in this region — most heavily for large alpha. 

Figure [7] shows posterior predictive surfaces for the remaining five responses as well. Drag 
and Pitch are shown for the beta = slice. Other slices look strikingly similar. Side, yaw, 
and roll are shown for the beta = 2 slice, as beta = slices for these responses are essentially 
zero. All six responses exhibit similar characteristics, in that the supersonic cases are tame 
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Mean posterior predictive — Lift 
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Mean posterior predictive — Drag 
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Mean posterior predictive — Yaw 
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Figure 7: LGBB slice of mean posterior predictive surface for six responses {lift, drag, pitch, 
side, yaw, roll) plotted as a function of Mach (speed) and Alpha (angle of attack) with Beta 
(side slip angle) fixed at zero for the first three responses, and two for the last three. 
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relative to their subsonic counterparts, with the most interesting region occurring near Mach 
1, and for large angle of attack (alpha). The treed GP model has enabled BAS to key in 
on this phenomenon and concentrate sampling there (Figure [6]). Compared to the initial 
experiment, BAS reduced the simulation burden on the supercomputer by more than 75%. 



6 Conclusion 

We showed how the treed GP LLM can be used as a surrogate model in the sequential design 
of computer experiments. A hybrid approach, combining active learning and classical design 
methodologies, was taken in order to develop a flexible system for use in the highly variable 
environment of asynchronous agent-based supercomputing. Two sampling algorithms were 
proposed as adaptations to similar techniques developed for a simpler class of models. One 
chooses to sample configurations with high posterior predictive variance (ALM); the other 
uses a criterion based on an average global reduction in uncertainty (ALC). These model 
uncertainty statistics were used to determine which of a set optimally spaced candidate 
locations should go out for simulation next. Optimal candidate designs were determined 
by adapting a classic optimal design methodology to Bayesian partition models. The result 
is a highly efficient Bayesian adaptive sampling (BAS) strategy, representing a significant 
improvement on the state-of-the-art of computer experiment methodology at NASA. ALM, 
ALC, and treed maximum entropy design are implemented in the tgp package for R available 
on CRAN. Code for adaptive sampling via an asynchronous supercomputer (emcee) interface 
is available upon request. 

There are some enhancements which can be made towards applying the methods herein to 
a broa der array of proble ms. Three such clo sely related problern s are of sampling to find ex- 



trema (jJones et al. 



19981 ) , to find contours 



aries, i.e., contours with large gradients (IBanerjee and Gelfand 



(IRanian et al 



20081) g e nera. 



ly, or to find bound- 



20061 ). a.k.a. Wombling. 



Other related problems include that of learning about, or finding extrema in, computer ex- 
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periments with multi-fidelity codes of varying execution co s ts (|Huang et all 120061 ) . and those 
which are paired with a physical experiment (IReese et al.l . 12004 ) . 
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A Active Learning — Cohn (ALC) 

Section [A. II derives ALC for the hierarchical CP and Section [A. 21 does the same for the LLM. 
A.l For hierarchical Gaussian processes 



The partition inverse equations (iBarnettl . Il979l ) can be used to write a covariance matrix 



Cat+i in terms of Cat, so to obtain an equation for Cjy+i in terms of Cj^: 



T 



m 



K 



p-1 



[Cm' 



(14) 



where m = [C(xi, x), . . . , C(xAr, x)], k = C(x, x), for an -|- point x where C(-, ■) is the 



covariance function, 



/iC^^m, and /i 



m"''Cjy"'^m) ^. If C^^ is available, these 



partitioned inverse equations allow one to compute C^^^, without explicitly constructing 
Cat+i (in 0{n^) rather than the usual 0(n^)). 

In the context of ALC sampling from the model in Eq. (jl]), the matrix which requires an 
inverse is Kat+i -I- Fat+iWF^.^]^, to calculate the predictive variance a"^(x). 
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N+l 



K 



N 



k7v(x) 



k;(x) A'(x,x) 



K 



N 



F^WF;^ FjvWf(x) 
f(x)^WFT f(x)TWf(x) 

k;v(x) + F;vWf(x) 



kT (x) + f (x)TWFT K(x, x) + f (x) ' Wf (x) 



(*) Using the notation Cat = Kn + FatWF^, qAr(x) = kAr(x) + FArWf(x), and k(x, y) 
A'(x, y) + f(x)^Wf(y) yields some simplification: 



■'N+l 



C 



N 



q7v(x) 
'^(x,y) 



Applying the partitioned inverse equations (1141) gives 



N+l 



FT^iWF^+i)-i 



(15) 



where g = — yuC^"'^qAr(x), and fi = (k(x, x) — qAr(x)^C^"'^qAr(x)) ^ from (*). We can now 
calculate the reduction in variance at y given that x is added into the data: 

AaJ(x) = a2(y)-a^(y), 
where a^{y) = a^[K{y,y) - q^(y)C^^q^(y)], 

and a-^(y) = a^[K{y,y) - qN+i{yVCjf\i(lN+i{y)]- 

Now ActJW = (r^[K{y,y) - q^(y)C^^qjv(y)] - (T^Hy^y) - q^+i(y)C^l^iqjv+i(y)] 
= (T^[qN+i{yVc^\iqN+i{y) - q^(y)C^^q7v(y)]. 

Focusing on qJ^^-^^{y)C'jy\^qN+i{y), first decompose qAr+i: 
Fjv+iWf(y) 



qjv+i = kAr+i(y) 4 
k7v(y) 

i^(y,x) 



+ 



Fat 



Wf(y) 



k7v(y) + FjvWf(y) 
if(y,x) + r(x)Wf(y) 





qw(y) 




'^(x,y) 
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Turning attention back to C^l^^q„+i(y), with the help of ( fTSl) : 

Cjv+iqjv+i(y) = 



qAf(y) 




«(x,y) 





[c^^ + ggV ^]q^(y) + g/^(x,y) 
g^qiv(y) + /^K(x,y) 



q7v+i(y)c^Viq^+i(y) 



qiv(y) 


1 


ft^lxjy) 





[C^^ + gg^/i-')q^(y) + gft:(x, y)) 

g^qiv(y) + /^fi;(x,y) 

q^(y) [(C^' + gg^ f^~')ciN{y) + gft:(x, y)] 
+ '^(x,y)[g^q7v(y) +/i/i(x,y)]. 



Finally 



Aa^(x) 



A^J(x) 



f^^[q^+i(y)^C^+iq7v+i(y) - q^(y)C^^qjv(y)]. 

c^^[q^(y)gg^/i^^q7v(y) + 2k(x, y)g^q7v(y) + /i«;(x, y)^] 

(^^ [qjv(y)C^^qjv(x) - K(x,y)'^ 



x,x) -qT(x)C^^q^(x) 



A. 2 For hierarchical (limiting) linear models 

Under the (limiting) linear model, computing the ALC statistic is more straightforward. 
AaJ(x) = a'{y) - aUy) = a'[l - f^{y)Y^J{y) - 1 - f^(y)V^^^^f (y)] 

= ^'f^(y)[v^.-v^,,Jf(y), 



where V^^^^ from (IGramacy and Led . l2008ar i includes x, and V^^ does not. Expanding out 
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A^J(x) 



-^f^(y) 



V 



N+l 



f^(y) 



V 



w 



-1 



V 



r2 1 + ^ 





T 






-1- 


Fat 




Ftv 










) 











f(y) 



V 



I3n 



f;f^ , f(x)f'(x) 



V 



I3n 



/3iv 



f(x)f^(x) 
^ + 9 



-1 



f(y) 



f(y)- 



The Sherman-Morrison- Woodbury formula (iBernsteinl . l2005l ). where V = f'''(x)(l + 



and A = V-^ gives 

PN 



Aa^(x) = a^f'(y) 



1 + 



f^fx)V^^f(x 



f(x)r(x) 

w 1 -\- g 



f(y) 



^^[f"(y)v^.f(x)]^ 
^ i + ^^ + r(x)v^^f(x) 
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